Project goal is to analyse international migration based on data obtained on the UN and World Bank data sites.
Data set contains historical data on migration for the years 1990, 1995, 2005, 2015 and 2019.
Our objectives are:
Display graphically migration flows, migration trends over time, largest migration destination and origins.
Find evidence that more developed, wealthier regions or countries attract more migrants. We could also verify whether the converse is also true, meaning that regions or countries which are less developed have more people leaving their borders.
In terms of trend, we would like to check that countries becoming weathier over time, also tend to attract more people over time.
1. Data set-up
2. Destination and Origin data set
This is the main data set, it contains migration stock from several years for all countries in the world. It also contains info on some country classification.
3. UN Development groups and World Bank’s Income group classifications of each country
This is an auxiliary data set that contains country classification by UN Development and World Bank’s Income groups.
## Parsed with column specification:
## cols(
## Index = col_double(),
## Region = col_character(),
## Notes = col_character(),
## Code = col_double(),
## TypeData = col_character()
## )
4. UN and World Bank Classifications
There are several classifications listed in the data set. They are defined in the first 20 or so rows. These are the classification / groups:
1: UN development group: More developed, Less developed, least developed and Less developed ex-least 2: World Bank Income groups: High-, Middle-, Upper-Middle-, Lower-Middle-, Low-, No Income 3: Geographic Regions: Africa, Asia, Europe, Latin America & the Caribbean, Northern America, and Oceania. 4: Sustainable Development Goal (SDG) 7 regions:Sub-Saharan Africa, Northern Africa and Western Asia, Central and Southern Asia, Eastern and South-Eastern Asia, Latin America and the Caribbean, Oceania, and Europe and Northern America. These regions are further divided into 22 geographic subregions.
Cross reference is done by classification #1, #2 and 3 above. This will be enhanced with adding these specific codes to countries. We also need to classify all countries by #4, as this will help with data set containing migration data.
5. Re-arranging classification and population data and joining datasets
Data set contains historical data on migration for the years 1990, 1995,2000, 2005,2010, 2015 and 2019.
Here are some hipotheses I would like to test:
1:Find evidence that more developed, wealthier regions or countries attract more migrants. We could also verify whether the converse is also true, meaning that regions or countries which are less developed have more people leaving their borders.
2:In terms of trend, we would like to check that countries becoming weathier over time, also tend to attract more people over time.
6.1. First, let’s check the total migration destination per country in 2019
World Map
Biggest migration destination is USA, followed by Germany, UK, France, Canada, Russia, Australia and Saudi Arabia.
6.2 This is an interactive map, showing the main migration routes in 2019 between countries with more than 1mn migrants leaving their borders.
There are some well know routes, like:
Mexico-China-India to USA
Algeria-Morocco to France
UK to Australia
Turkey to Germany
Others generated by geopotical issues, like war:
Syria to Turkey
Ukraine to Russia
6.3 Using income-based classification, below shows the flow of migrants between regions in 2019, as classified by the World Bank.
As can be see, High-Income countries received the highest numbers of migrants, including from within its sector migration. Upper-middle-income group seems to have more migration than low-income group.
6.4 Aggregated migration numbers
In aggregate, international migrants as a percentage of total population does not seem to be a high number. On average around 3% over the years.
## # A tibble: 7 x 5
## Year count TotalMigration Population MigrPop
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1990 232 153011473 5306708000 0.0288
## 2 1995 232 161316895 5722819000 0.0282
## 3 2000 232 173588441 6121486000 0.0284
## 4 2005 232 191615574 6519162000 0.0294
## 5 2010 232 220781909 6933592000 0.0318
## 6 2015 232 248861296 7356190000 0.0338
## 7 2019 232 271642105 7689649000 0.0353
6.5 Exploring trends by different classifications.
Let’s show graphically the international migrants as a percentage of total population over time in order to check trends.
1: Sustainable Development Goal (SDG) Let’s analyse the major areas of migration destination by SDG region. Plot shows the number of international migrants by area of destination.
Major area of destination is Europe and Northern America, where most of the world’s wealth is concentrated. Second largest area of destination is Northern Africa and Western Asia, where the high number of migrants could be associated with wealth driven by the oil-producing countries (and Israel) but also by geopolitical reasons, like wars and forced migration. In this group we find countries like Syria, Yemen, Turkey, and Iraq.
In terms of trends, migration has increased over time in most regions.
Let’s check the number of migrants relative to the region total population.
Similarly to the previous charts, migrant stock relative to the total population is very relevant in Oceania, Europe/Northern America and in Northern Africa. As observed before, it seems it is related to wealth and to other forces.
In terms of trend, migration seems to be increasing in these three regions but it is reasonably stable in the other regions.
There is some evidence in these charts to assume that migration is driven but both wealth and geopolitical reasons, based on geography only.
2:World Bank Income classification The World Bank classifies the world’s economies into four income groups — high, upper-middle, lower-middle, and low. They base this assignment on Gross National Income (GNI) per capita (current US$) calculated using the Atlas method.
As evidenced before, high- and upper-middle-income countries are the groups attracting more migrants over time. Despite the fact that the no-income group has the highest percentage, we cannot draw any meaning conclusions for it as is classified in the no income group.
6.6 Migration by country
Let’s analyse the proportion that each country hosts migrants in relation to the total migration number.
## Destination TotalMigration Population r Cumul
## 1 United States of America 50.661149 329.065 1 18.64996
## 2 Germany 13.132146 83.517 2 23.48432
## 3 Saudi Arabia 13.122338 34.269 3 28.31506
## 4 Russian Federation 11.640559 145.872 4 32.60032
## 5 United Kingdom 9.552110 67.530 5 36.11675
## 6 United Arab Emirates 8.587256 9.771 6 39.27799
## 7 France 8.334875 65.130 7 42.34632
## 8 Canada 7.960657 37.411 8 45.27689
## 9 Australia 7.549270 25.203 9 48.05601
## 10 Italy 6.273722 60.550 10 50.36557
## 11 Spain 6.104203 46.737 11 52.61271
## 12 Turkey 5.876829 83.430 12 54.77616
## 13 India 5.154737 1366.418 13 56.67378
## 14 Ukraine 4.964293 43.994 14 58.50129
## 15 South Africa 4.224256 58.558 15 60.05637
## 16 Kazakhstan 3.705556 18.551 16 61.42051
## 17 Thailand 3.635085 69.626 17 62.75870
## 18 Malaysia 3.430380 31.950 18 64.02153
## 19 Jordan 3.346703 10.102 19 65.25355
## 20 Pakistan 3.257978 216.565 20 66.45292
It can be seen that 20 countries hosts around 2/3 of all migrants in the world. Let’s analyse in more detail these twenty countries.
The USA is by far the most attractive country, representing 19% of all migrant destination in 2019. This could be annecdotally explained by the fact it is the wealthiest countries in the world and maybe also by the fact that it always had a big migrant population. The others countries are either wealthy or are located in problematic areas such as Turkey, Jordan.
6.7 Let’s visualize in more detail the biggest migration flows between the US and its top migrantion source. (greater than 1mn of migrants).
Biggest migrants are from Mexico, China and India. This could be driven by not only economic reasons but by family ties.
6.8 Let’s analyse the flow to Turkey.
Biggest flow to Turkey is from Syria due to recento geopolitical reasons. There’s also a big migration from Bulgaria, which does not seem obvious.
From the visual inspections above, there’s evidence to support that the more wealth a country has, the more migrants it attracts. There’s also evidence that family ties play a role in the flow of migrants.
7.1 Let’s analyse the relationship between migration and three economic variables:
Gross National Income (GNI) per capita: as a measure of wealth - (converted to U.S. dollars using the World Bank Atlas method, divided by the midyear population)
GINI index: as a measure of inequality
Personal Remittances (received as % of GDP): as a measure of family ties
## Joining, by = c("Year", "Destination", "RegionCode", "Sub-RegionCode", "CountryCode", "UNCode", "WBCode")
7.2 Visual inspection of GNI vs Total Migration by year and by income group.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Based on six years of data, there seems to be a direct relationship between GNI and number of migrants entering a country in countries in the Upper-middle and high income region, specially up to migration size of around 2-4 mn.
7.3 Let’s plot GINI index vs total exits.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
A higher Gini index indicates greater inequality, with high income individuals receiving much larger percentages of the total income of the population.
It looks like there’s a direct relationship, when looking at exits lower than 10mn per year.
7.4 Let’s check if there’s any evidence between the amount of remmitances received and total exits, as an indication of family ties, in Lower- and Low-income countries.
Indication of a direct relationship between remmitances received and total migration, indicating some sort of family tie proxy.
Let’s build a panel data mode consisting of data for six years (1990, 1995, 2000, 2005, 2010,2015) for the three variables analysed above: GNI, Gini index and remittances received.
We will be using package plm (reference https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html)
8.1 First model - total migration received:
TotalMigration ~ GNI +Gini + PersonalRemittances
## at least one couple (id-time) has NA in at least one index dimensionin resulting pdata.frame
## to find out which, use e.g.table(index(your_pdataframe), useNA = "ifany")
## TotalMigration GNI Personal Gini
## 4-1970 NA 159.7012 NA NA
## 4-1971 NA 162.8596 NA NA
## 4-1972 NA 138.3326 NA NA
## 4-1973 NA 146.2643 NA NA
## 4-1974 NA 177.4133 NA NA
## 4-1975 NA 190.5388 NA NA
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = TotalMigration ~ GNI + Gini + Personal, data = reg,
## model = "within")
##
## Unbalanced Panel: n = 99, T = 1-6, N = 278
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -2.257074 -0.079223 0.000000 0.058488 2.700035
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## GNI 3.1643e-05 6.0395e-06 5.2393 4.569e-07 ***
## Gini -5.2870e-03 1.6085e-02 -0.3287 0.7428
## Personal -8.4018e-03 2.3606e-02 -0.3559 0.7223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 78.07
## Residual Sum of Squares: 67.383
## R-Squared: 0.13689
## Adj. R-Squared: -0.35841
## F-statistic: 9.30472 on 3 and 176 DF, p-value: 9.6187e-06
p-value indicates that only GNI is statistically significant.
8.2 Second model - total migration exited:
## at least one couple (id-time) has NA in at least one index dimensionin resulting pdata.frame
## to find out which, use e.g.table(index(your_pdataframe), useNA = "ifany")
## Warning in cor(y, haty): the standard deviation is zero
## Warning in cor(y, haty): the standard deviation is zero
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = TotalOrigin ~ GNI + Gini + Personal, data = reg,
## model = "within")
##
## Unbalanced Panel: n = 99, T = 1-6, N = 278
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## GNI 0 0 NA NA
## Gini 0 0 NA NA
## Personal 0 0 NA NA
##
## Total Sum of Squares: 0
## Residual Sum of Squares: 0
## R-Squared: NA
## Adj. R-Squared: NA
## F-statistic: NaN on 3 and 176 DF, p-value: NA
Model does not work, maybe because of lack of data.
8.3 Let’s filter for some countries and include only countries within the Low-Income groups.
## Warning in cor(y, haty): the standard deviation is zero
## Warning in cor(y, haty): the standard deviation is zero
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = TotalOrigin ~ GNI + Gini + Personal, data = reg,
## model = "within")
##
## Unbalanced Panel: n = 35, T = 1-6, N = 77
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## GNI 0 0 NA NA
## Gini 0 0 NA NA
## Personal 0 0 NA NA
##
## Total Sum of Squares: 0
## Residual Sum of Squares: 0
## R-Squared: NA
## Adj. R-Squared: NA
## F-statistic: NaN on 3 and 39 DF, p-value: NA
Same, model does not work with migration exit data.
8.4 Let’s focus on Total migration received for some countries within the High- and Low-Middle-Income groups.
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = TotalMigration ~ GNI + Gini + Personal, data = reg,
## model = "within")
##
## Unbalanced Panel: n = 56, T = 1-6, N = 183
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -2.219419 -0.115321 0.000000 0.091078 2.698425
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## GNI 3.0255e-05 6.4042e-06 4.7243 6.149e-06 ***
## Gini 1.4053e-02 2.4477e-02 0.5741 0.5669
## Personal -8.3934e-03 3.1954e-02 -0.2627 0.7932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 61.783
## Residual Sum of Squares: 52.275
## R-Squared: 0.1539
## Adj. R-Squared: -0.24186
## F-statistic: 7.51818 on 3 and 124 DF, p-value: 0.00011559
Again, only GNI is statistically signficant.
8.5 Instead of regression three variables, let’s regress only one.
Gini index vs Exits
## Warning in cor(y, haty): the standard deviation is zero
## Warning in cor(y, haty): the standard deviation is zero
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = TotalOrigin ~ Gini, data = reg, model = "within")
##
## Unbalanced Panel: n = 59, T = 1-6, N = 192
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## Gini 0 0 NA NA
##
## Total Sum of Squares: 0
## Residual Sum of Squares: 0
## R-Squared: NA
## Adj. R-Squared: NA
## F-statistic: NaN on 1 and 132 DF, p-value: NA
This is not a good model either.
Let’s try Remittances received vs Total Migration.
## at least one couple (id-time) has NA in at least one index dimensionin resulting pdata.frame
## to find out which, use e.g.table(index(your_pdataframe), useNA = "ifany")
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = TotalMigration ~ Personal, data = reg, model = "within")
##
## Unbalanced Panel: n = 161, T = 1-6, N = 793
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -2.70188871 -0.04028311 -0.00092524 0.02402479 2.98367455
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## Personal -0.0012022 0.0030564 -0.3933 0.6942
##
## Total Sum of Squares: 136.82
## Residual Sum of Squares: 136.78
## R-Squared: 0.00024512
## Adj. R-Squared: -0.25484
## F-statistic: 0.15471 on 1 and 631 DF, p-value: 0.69421
Model has a high p-value indicating it is stastically insignificant.
Let’s run again Total Migration received vs GNI.
## at least one couple (id-time) has NA in at least one index dimensionin resulting pdata.frame
## to find out which, use e.g.table(index(your_pdataframe), useNA = "ifany")
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = TotalMigration ~ GNI, data = reg, model = "within")
##
## Unbalanced Panel: n = 200, T = 2-6, N = 1181
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -2.6010553 -0.0662769 -0.0024686 0.0475846 3.8732824
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## GNI 1.9379e-05 2.0904e-06 9.2703 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 217.27
## Residual Sum of Squares: 199.76
## R-Squared: 0.080622
## Adj. R-Squared: -0.10701
## F-statistic: 85.9386 on 1 and 980 DF, p-value: < 2.22e-16
As indicated before, there’s a strong statistical significance between total migration received and GNI. This could well be due to the fact that most of the migration is received by the most wealthy country, as indicated in the plots on top.
Project consisted of showing migration flows between countries and of trying to model and estimate which variables could dtermine such behaviour.
It is very clear from the data, at least visually, that wealth and family ties play an important part in the migration flow.
Not neglegible is also geopolitical considerations, like wars.
Statiscally, we could only demonstrate that GNI as a measure of wealth is significant in explaning migration flows. There are many more variables that could be explored, and this is left for a future work.